1. Why study R&D activities?

Here, I investigate what drives a company to spend and undertake research and development (R&D) activities. R&D expenditure is a recurrent topic in discussions about innovation in Australia, given the nation’s moderate levels of R&D intensity compared to similar countries. There are two main sides underlying these discussions: one side argues for boosting R&D activities to stimulate the generation of novel products/services; the other argues that R&D activities only covers a portion of the entire innovation activities in an economy, and that we are better off boosting non-R&D activities to stimulate novel offerings.

The solution to this tension is likely to lie between the two. For some companies, R&D is a vital component of the innovation process. For others, R&D activities are not the most optimal way to innovate. However, to understand this, we must have a more definitive understanding of who is doing R&D in Australia and why.

To bridge this knowledge gap, I develop multiple classification models with a battery of company variables from ASX-listed companies to predict the extent to which a company undertakes R&D activities. The accuracy of each model is assessed, as well as the list of most important features for each model, in order to identify the key factors driving R&D undertaking.

2. Data and variables

The data was collected from three different datasets: S&P’s Capital IQ, Morningstar’s DatAnalyis, and IP Australia’s IPGOD. To maximise the coverage of company data, R&D expenditure values, as well as a range of organisational and financial indicators, were sourced from both Capital IQ and DatAnalysis and aggregated into a single dataset. Patents, trademarks and design data were then integrated into the dataset using a combination of fuzzy name matching and company ABN. After integrating the three data sources, my final sample was 1387 companies that are either currently listed or were listed at some point in the past.

The following table lists the variables (predictors) developed and used for the classification models. It also includes a dichotomous variable indicating whether or not the company reported R&D expenditures in any of the last three years (the outcome variable). These variables were selected after a review of relevant academic literature on corporate R&D and strategic management.

kable(describe(data))
vars n mean sd median trimmed mad min max range skew kurtosis se
ticker* 1 1387 694.0000000 4.005367e+02 694.000000 694.0000000 514.462200 1.000000 1387.0 1386.0 0.0000000 -1.2025958 10.7548439
company_name* 2 1387 694.0000000 4.005367e+02 694.000000 694.0000000 514.462200 1.000000 1387.0 1386.0 0.0000000 -1.2025958 10.7548439
ABN* 3 1387 693.1824081 3.996539e+02 694.000000 693.2853285 512.979600 1.000000 1384.0 1383.0 -0.0021396 -1.2016739 10.7311389
business_description* 4 1387 694.0000000 4.005367e+02 694.000000 694.0000000 514.462200 1.000000 1387.0 1386.0 0.0000000 -1.2025958 10.7548439
trading_status* 5 1387 2.9257390 2.730712e-01 3.000000 3.0000000 0.000000 1.000000 3.0 2.0 -3.6613905 13.1848840 0.0073323
company_status* 6 1387 2.0519106 2.376266e-01 2.000000 2.0000000 0.000000 1.000000 4.0 3.0 4.4930192 22.1814044 0.0063805
country_of_incorporation* 7 1387 1.0273973 1.632969e-01 1.000000 1.0000000 0.000000 1.000000 2.0 1.0 5.7840904 31.4783987 0.0043847
headquarters_city* 8 1387 196.6683490 8.590309e+01 208.000000 205.7668767 85.990800 1.000000 302.0 301.0 -0.7728700 -0.5148393 2.3065909
headquarters_state* 9 1387 6.2321557 2.920946e+00 8.000000 6.4185419 1.482600 1.000000 9.0 8.0 -0.5064902 -1.4651933 0.0784305
headquarters_country* 10 1387 1.0346071 1.828484e-01 1.000000 1.0000000 0.000000 1.000000 2.0 1.0 5.0868097 23.8928605 0.0049097
no_geographic_segments 11 1387 2.0151406 1.916113e+00 1.000000 1.5841584 0.000000 1.000000 19.0 18.0 3.2239594 15.3850767 0.0514497
primary_industry* 12 1387 57.2126893 3.659888e+01 45.000000 54.6579658 35.582400 1.000000 137.0 136.0 0.5307780 -0.8432416 0.9827194
GICS_sector* 13 1387 6.6647441 2.619593e+00 7.000000 6.9360936 2.965200 1.000000 11.0 10.0 -0.6135100 -0.7950801 0.0703389
GICS_industry_group* 14 1387 12.5544340 5.114324e+00 14.000000 12.6372637 4.447800 1.000000 24.0 23.0 -0.1676416 -0.4823004 0.1373251
GICS_industry* 15 1387 37.9250180 1.577283e+01 46.000000 39.0351035 8.895600 1.000000 66.0 65.0 -0.6957283 -0.7035707 0.4235176
year_founded 16 1387 1991.4390771 2.874634e+01 2000.000000 1997.4797480 10.378200 1817.000000 2020.0 203.0 -3.0176919 10.5398807 0.7718702
ASX_listing_year 17 1387 2001.9329488 1.336638e+01 2005.000000 2003.5040504 10.378200 1885.000000 2020.0 135.0 -1.9366158 8.1045081 0.3589016
no_business_segments 18 1387 2.5818313 2.683048e+00 1.000000 2.0045005 0.000000 1.000000 25.0 24.0 2.8292692 12.1476523 0.0720427
pct_external_directors 19 1387 72.1888464 1.929793e+01 75.000000 73.8982808 12.350058 0.000000 100.0 100.0 -0.8846479 0.9153692 0.5181703
no_employees 20 1387 1363.4181687 9.874289e+03 32.000000 109.3267327 40.030200 0.000000 217000.0 217000.0 16.7851779 336.5368781 265.1353268
market_capitalisation 21 1387 1308.1833888 7.437476e+03 51.065809 160.0949210 63.837763 0.128263 134670.9 134670.8 11.8075302 169.2407139 199.7042610
total_revenue 22 1387 670.4267977 3.860641e+03 2.850464 54.9747032 4.226097 -42.101000 63850.0 63892.1 11.5099131 163.1198928 103.6623891
total_assets 23 1387 4267.5353311 5.146816e+04 28.302921 145.8290219 37.972588 0.008000 1014060.0 1014060.0 17.5617697 317.9114900 1381.9756451
intangible_assets 24 1387 113.2225884 9.172999e+02 0.000000 3.4967767 0.000000 0.000000 25940.0 25940.0 19.2300782 479.6826627 24.6304946
no_designs 25 1387 1.0511896 1.207478e+01 0.000000 0.0000000 0.000000 0.000000 251.0 251.0 15.6904638 264.4125215 0.3242210
no_patents 26 1387 3.8464311 2.634925e+01 0.000000 0.1836184 0.000000 0.000000 566.0 566.0 13.4783105 222.2241803 0.7075059
no_trademarks 27 1387 4.6236482 3.393999e+01 0.000000 0.4275428 0.000000 0.000000 745.0 745.0 14.7753680 257.2173061 0.9113255
no_corporate_investments 28 1387 3.3100216 3.576487e+00 2.000000 2.7848785 1.482600 0.000000 64.0 64.0 4.9075035 62.5808992 0.0960326
no_sponsors 29 1387 3.4131218 3.395554e+00 2.000000 2.7398740 1.482600 1.000000 47.0 46.0 3.0857192 22.0892280 0.0911743
no_industries_associated 30 1387 9.7087239 6.360180e+00 8.000000 8.6849685 5.930400 2.000000 57.0 55.0 2.0036828 6.5652500 0.1707777
no_products 31 1387 3.8904110 1.543815e+00 5.000000 4.0576058 0.000000 1.000000 13.0 12.0 -0.4550165 0.1908420 0.0414531
no_business_relationships 32 1387 13.3280461 8.292181e+00 12.000000 13.3186319 10.378200 1.000000 25.0 24.0 0.1943857 -1.3668602 0.2226540
no_competitors 33 1387 18.8211968 1.691020e+01 16.000000 16.0819082 0.000000 1.000000 254.0 253.0 5.6446047 46.9564381 0.4540573
no_customers 34 1387 6.8759913 6.923029e+00 4.000000 5.4374437 2.965200 1.000000 25.0 24.0 1.6416600 1.5184286 0.1858908
no_suppliers 35 1387 6.0814708 5.882924e+00 4.000000 4.8514851 2.965200 1.000000 25.0 24.0 1.9317810 3.2288133 0.1579629
no_strategic_alliances 36 1387 5.6128335 5.214663e+00 4.000000 4.5562556 2.965200 1.000000 25.0 24.0 2.1839454 4.7848863 0.1400193
no_competitors_na 37 1387 0.6532084 4.761204e-01 1.000000 0.6912691 0.000000 0.000000 1.0 1.0 -0.6431058 -1.5875579 0.0127843
no_strategic_alliances_na 38 1387 0.2891132 4.535141e-01 0.000000 0.2367237 0.000000 0.000000 1.0 1.0 0.9293416 -1.1371425 0.0121773
no_employees_na 39 1387 0.2790195 4.486789e-01 0.000000 0.2241224 0.000000 0.000000 1.0 1.0 0.9843174 -1.0318617 0.0120475
no_customers_na 40 1387 0.2472963 4.315961e-01 0.000000 0.1845185 0.000000 0.000000 1.0 1.0 1.1701751 -0.6311437 0.0115888
year_founded_na 41 1387 0.1427541 3.499481e-01 0.000000 0.0540054 0.000000 0.000000 1.0 1.0 2.0402354 2.1641225 0.0093965
no_suppliers_na 42 1387 0.1398702 3.469774e-01 0.000000 0.0504050 0.000000 0.000000 1.0 1.0 2.0743143 2.3044427 0.0093167
no_geographic_segments_na 43 1387 0.1369863 3.439569e-01 0.000000 0.0468047 0.000000 0.000000 1.0 1.0 2.1092874 2.4508617 0.0092356
intangible_assets_na 44 1387 0.0713771 2.575465e-01 0.000000 0.0000000 0.000000 0.000000 1.0 1.0 3.3261090 9.0695417 0.0069154
no_business_relationships_na 45 1387 0.0540735 2.262443e-01 0.000000 0.0000000 0.000000 0.000000 1.0 1.0 3.9391481 13.5266414 0.0060749
no_products_na 46 1387 0.0353280 1.846742e-01 0.000000 0.0000000 0.000000 0.000000 1.0 1.0 5.0287144 23.3047728 0.0049587
no_business_segments_na 47 1387 0.0216294 1.455227e-01 0.000000 0.0000000 0.000000 0.000000 1.0 1.0 6.5697754 41.1916493 0.0039074
total_revenue_na 48 1387 0.0115357 1.068215e-01 0.000000 0.0000000 0.000000 0.000000 1.0 1.0 9.1388328 81.5770814 0.0028683
pct_external_directors_na 49 1387 0.0036049 5.995410e-02 0.000000 0.0000000 0.000000 0.000000 1.0 1.0 16.5472211 272.0066398 0.0016098
has_RD_exp* 50 1387 1.2667628 4.424269e-01 1.000000 1.2088209 0.000000 1.000000 2.0 1.0 1.0535935 -0.8905812 0.0118796

3. Exploratory analysis

The following figures show the bivariate relationships between the predictors and R&D undertaking, organised in three groups:

Regarding company characteristics, we can see that companies incorporated and/or headquartered outside Australia are slightly more likely to undertake R&D. There is also some differences across states in which companies are headquartered. The biggest differences can be seen across GICS (Global Industry Classification Standard) sectors and industries. Also, companies with larger number of geographic segments are slightly more likely to undertake R&D. So, the industry of operation and internationalisation emerge as potential key predictors, confirming what the theory says about R&D expenditure.

Regarding stakeholder variables, companies with more customers and business relationships are more likely to undertake R&D. In principle, this confirms theory, which suggest that greater relationships with customers and suppliers are key drivers of R&D activities. Companies with less investments in other companies and more sponsors (i.e. companies investing in them) are more likely to undertake R&D, suggesting that R&D-intensive companies are more likely to be supported by - rather than provide support to - external organisations. Also, a greater percentage of external directors seems to have a relationship with R&D undertaking, which aligns with the insights from company characteristic predictors.

As for company performance, companies with R&D show a greater market capitalisation, which confirms both theoretical and practical knowledge on the topic. Also, firms with R&D are more likely to have a moderate revenue and intangible assets, thus, there is a sweet spot in terms of revenues and intangibles that favour R&D undertaking. Lastly, companies with more design, patents an trademarks seem to be more likely to undertake R&D.

Let’s see if these insights hold when building the predictive models.

4. Building the classifiers

Three classification models were developed in order to predict whether or not a company undertakes R&D. Random forest, logistic regression, and naive Bayes methods were used to build the three classifiers. These three are among the most widely-used machine learning methods within the data science community.

Firstly, I build the training set to train the classifier, and the testing set to evaluate its performance:

set.seed(1234) #for reproducibility

#Build training and testing sets:
data_ready <- data[,-c(1:4,8,12,15)] #remove firm metadata and categorical variables with extreme number of categories

sample_size <- floor(0.75 * nrow(data_ready)) #size of training dataset = 75% of data
train_indices <- sample(nrow(data_ready), size = sample_size) #obtain list of indices for training dataset
train_data <- data_ready[train_indices, ] #build training dataset
test_data <- data_ready[-train_indices, ] #build testing dataset

Then, I train and test each model, and assess how important each predictor is for the model:

Random forest-based classifier:

rf_model1 <- randomForest::randomForest(has_RD_exp ~., data = train_data, ntree = 5000) #build model using 5000 trees

#calculate accuracy with testing set:
pred <- predict(rf_model1, newdata = test_data[,!(names(test_data) == "has_RD_exp")]) #use model to predict labels
cm <- table(test_data$has_RD_exp, pred) #confusion matrix using true outcome vs predicted outcome
rf_model1_acc <- sum(diag(cm)) / (sum(cm)) * 100 #the cases predicted correctly, divided by total cases

Result: The model accuracy is 85.01%. Consistent with the exploratory analysis and past literature, the industry predictors are the most dominant. Interestingly, the amount of intangibles, revenue, assets, market capitalisation and patents are also important predictors of R&D.

Logistic regression-based classifier:

lr_model1 <- glm(has_RD_exp ~., data = train_data, family = "binomial") %>%
  MASS::stepAIC(trace = FALSE) # performs stepwise variable selection

#calculate accuracy with testing set:
pred2 <- predict(lr_model1, newdata = test_data[,!(names(test_data) == "has_RD_exp")]) #use model to predict labels
pred_classes2 <- ifelse(pred2 > 0.5, "yes", "no") #turn predictions from continuous [0,1] to dichotomous
cm2 <- table(test_data$has_RD_exp, pred_classes2) #confusion matrix using true outcome vs predicted outcome
lr_model1_acc <- sum(diag(cm2)) / (sum(cm2)) * 100 #the cases predicted correctly, divided by total cases

Result: The model accuracy is 84.44%. Similar to the random forest classifier, the industry variables are the most dominant predictors of R&D undertaking. The number of patents and revenue are also important predictors (similar to random forest). Interestingly, the number of geographic segments, the age and number of strategic alliances emerge as important predictors in this model.

Naive Bayes-based classifier:

nb_model1 <- caret::train(x = as.data.frame(train_data[,!(names(train_data) == "has_RD_exp")]),
                          y = train_data$has_RD_exp,
                          method = "naive_bayes",
                          preProcess=c("scale","center"),
                          trControl = trainControl(method = 'cv', number = 10))

#calculate accuracy with testing set:
pred3 <- predict(nb_model1, newdata = test_data[,!(names(test_data) == "has_RD_exp")]) #predict labels in test set using model built
cm3 <- table(test_data$has_RD_exp, pred3) #confusion matrix using true outcome vs predicted outcome
nb_model1_acc <- sum(diag(cm3)) / (sum(cm3)) * 100 #the cases predicted correctly, divided by total cases

Result: The model accuracy is 80.69%. The most interesting insight about this classifier is that the industry is not the most dominant factor, but rather the intangible and number of patents. Revenue, trademarks and number of customers are close seconds.

5. Model comparison:

When comparing the accuracy of each model, the random forest classifier emerge as the highest performer. However, the performance of the logistic regression classifier was very close. Let’s look at the ROC (receiver operating characteristic) curve and corresponding AUC (area under the curve) of each classifier to assess performance further:

6. Discussion:

From the curve, it becomes clear the superiority of both the random forest and logistic regression classifiers compared to the naive Bayes classifier. The random forest classifier has a greater AUC, and it can be seen that it performs better than the logistic regression classifier in the middle section of the curves (where the false positive rates are between 0.1 and 0.6).

Considering the superiority of the random forest classifier, we go back to its most important predictors in order to establish a definitive answer to our research question. Unarguably, the industry in which the company operates is a dominant feature driving R&D undertaking. This insight suggest that Australia’s R&D intensity is a reflection of the industry structure underpinning the economy. In fact, if we go back to the exploratory analysis, we see that Materials, Energy and Diversified Financials industries (the three most populous industries in Australia) show comparatively low counts of companies with R&D.

When we consider the top 10 predictors (excluding GICS industry group and sector), interesting insights emerge too. The amount of intangible assets, total revenue and total assets all play a role in predicting R&D undertaking. The curves from the exploratory analysis indicate a group of companies heavy in assets (inc. intangibles) and revenues that do not undertake R&D. This can be interpreted in three ways: they outsource innovation activities to other organisations, undertake other forms of innovation, or do not innovate. More research covering non-R&D innovation activities might be needed to better understand this situation.

Next, we have market capitalisation and number of patents. Companies with more patents and greater market capitalisation are more likely to undertake R&D. This aligns with the majority of academic research arguing that R&D outputs are typically protected through patents, and that the market values R&D intensity (as well as patents). However, this might be due the lack of reliable metrics for non-R&D innovation, which forces the market to rely on R&D and patents.

The remaining three features in the top 10 predictors are ** headquarter state, number of employees and year founded**. The difference in the headquarter state may be attributable to the uneven location of businesses depending on the industry (i.e. mining and resource industries are more likely to be located in some states than others). With number of employees, we can see a large number of companies with 0-1 employees not undertaking R&D. These are likely to be junior mining companies, which have a greater focus on exploration and new discoveries than on R&D. Lastly, there is a group of relatively young companies that are less likely to undertake R&D, suggesting that these are companies that are still focusing on developing product/service lines and business processes, before committing to further R&D expenditures.

There are also important insights regarding the features in the bottom of the importance ranking. The company and trading status of the companies in my sample, as well as features such as number of products, segments, competitors and number of alliances are less likely to play a role in R&D undertaking.

7. Conclusions:

This study shows that whether or not a company undertakes R&D can be predicted using a battery of company structural and performance features. In regards to the factors driving R&D undertaking in Australia, a random forest-based classification model indicate that the industry in which the company operates is the most dominant factor influencing R&D. The amount of assets, revenues, market capitalisation and number of patents play a secondary, although a noticeable, role as well.